Estimating ancestral states

Author

Daniel Padfield

Published

November 8, 2022

1 Outline

Using information on the tip states of a phylogenetic tree we can use evolutionary models to estimate the ancestral states of internal nodes and transitions between states.

We fit an \(Mk\) model of evolution to our tree and habitat preference data to understand how habitat preference evolved. For example does evolution readily allow transitions between specialist habitat preferences (i.e. from marine mud only to terrestrial only).

We also try and fit an \(Mk\) model of evolution when we combine two traits together (habitat preference and whether or not the species has high or low diversification rate as identified by BAMM).

Ancestral state reconstructions can also be estimated from the MuSSE and HiSSE models we are aiming to fit later on. These models would have a lot of free parameters and we are unsure whether they can be fit with the dataset we have. We hope to reduce the number of free parameters in those models by estimating the transition matrix which best fits the data in the \(Mk\) model. This would reduce the number of free parameters.

2 TL;DR

Go to Section 12 for the best \(Mk\) model and a discussion of the transition matrix.

3 Progress and results

  • Done ancestral state reconstructions using a variety of methods:

    • ape::ace()

    • castor::asr_mk_model()

    • diversitree::fit_mk()

    • phytools::make.simmap()

All of these use a maximum likelihood and trying to run MCMC with diversitree took an EXTREMELY long time for only 2 steps (~30 minutes) which makes this approach pretty inviable. Maybe lets try different implementations?

  • Have written a quick primer to BayesTraits but have not used it on our tree due to the time it takes to run the analysis with a transition matrix where all states are different.

  • Have compared the three standard models of evolution for the transition matrix (equal rates, symmetric rates, and all rates different) using ape::ace() and the all-rates different method is best. Then did model simplification to find the best model for the transition matrix.

  • Have plotted the ancestral state reconstructions of a subset of the tree. For the whole tree my computer gave up.

  • Have plotted the transition matrix in multiple ways, see ?@sec-best-model

4 Fitting transition rate models and doing ancestral state reconstructions

We can do ancestral state reconstruction with a couple of different methods in R as well as trying in BayesTraits.

For discrete traits, the most commonly used model for evolution on tree is called the \(Mk\) model. \(M\) stands for Markov because the modelled process is a continuous-time Markov chain, and \(k\) because the model is generalised to include an arbitrary number (\(k\)) states.

The central attribute of the \(Mk\) model is a the transition matrix, \(Q\), which gives the instantaneous transition rates between states.

An important distinction in ancestral state reconstruction for discrete characters is joint vs. marginal reconstruction. Joint reconstruction is finding the set of character states at all nodes that maximise the likelihood. Marginal reconstruction is finding the state at the current node that maximises the likelihood integrating over all the other states at all nodes, in proportion to their probability.

We can do both marginal reconstruction using ape::ace() and castor::asr_mk_model() and joint reconstruction using ape::ace().

An alternative tactic to the one outlined above is to use an MCMC approach to sample character histories from their posterior probability distribution. This is called stochastic character mapping. The model is the same but in this case we get a sample of unambiguous histories for our discrete character’s evolution on the tree - rather than a probability distribution for the character at nodes.

Finally we can use BayesTraits to do both maximum likelihood and MCMC approaches. We do not use this approach in this walk-through.

The evolutionary models that are used to reconstruct ancestral states have a transition matrix between discrete states that define the rates of transitions between states on the tree. In most of these models, there are three that are easy to define: ARD means all transition rates are different, SYM means transitions to and from the same states are the same, and ER is an equal rates model where all rates between states are the same. These transition matrices can also be custom made but for now these three seem sensible starting points. Transition matrices also underpin the MuSSE and BiSSE models we are hoping to fit.

5 Load in R packages

First we will load in R packages used and the metadata file used and wrangled in a previous walk-through.

We also load in the final phylogenetic tree, the colours used for habitat preferences, and the position of shift nodes.

Code
# load packages
library(here)
library(tidyverse)
library(ggtree)
library(ggnewscale)
library(RColorBrewer)
library(patchwork)
library(ape)
library(phytools)
library(BAMMtools)
library(coda)
library(MetBrewer)
library(ggpp)
library(castor)
library(diversitree)
library(tidygraph)
library(igraph)
library(ggraph)
library(GGally)
library(ggrepel)

# set where I am in the project
here::i_am('scripts/sequencing_rpoB/analyses/estimating_transition_rates.qmd')


# read in metadata
d_meta <- read.csv(here('data/sequencing_rpoB/processed/asv_metadata.csv'))

# read in habitat trait colours
cols_hab <- readRDS(here('data/sequencing_rpoB/processed/habitat_colours.rds'))

# read in tree
tree <- read.tree(here('data/sequencing_rpoB/bamm/rerooted-pruned-chronopl10.tre'))

# edit tree labels
d_labels <- data.frame(tip_label = tree$tip.label) %>%
  separate(., tip_label, c('part1', 'part2', 'part3'), sep = '_', remove = FALSE) %>%
  unite('tip_label_new', c(part1, part2), sep = '_')

tree$tip.label <- d_labels$tip_label_new

# read in shift nodes
shiftnodes <- readRDS(here('data/sequencing_rpoB/processed/shiftnodes.rds'))

6 Custom function

We will write a custom function for getting the transition matrix into a format that is easy to plot using ggplot2.

Code
# get dataframe of transition rates
ace_transition_df <- function(ace_object){
  
  temp_1 <- ace_object$index.matrix
  colnames(temp_1) <- rownames(temp_1) <- colnames(ace_object$lik.anc)
  
  # turn into dataframe
  temp_1 <- as_tibble(temp_1, rownames = 'state_1') %>%
    # make long format, the column names become state 2
    pivot_longer(freshwater:terrestrial, names_to = 'state_2', values_to = 'order') %>%
    mutate(free_param = ifelse(is.na(order)|order==0, 'no', 'yes'),
           num_params = length(ace_object$rates))
  
  temp_1$transition_rate <- NA
  temp_1$transition_rate[temp_1$order == 0] = 0
  
  # run for loop to input estimated transition rates
  for(i in 1:length(ace_object$rates)){
    temp_1[temp_1$order %in% i,]$transition_rate <- ace_object$rates[i]
  }
  
  return(select(temp_1, -order, state_1, state_2, transition_rate, free_param, num_params))
  
}

7 Create a master dataframe for the trait states

Different methods for estimating ancestral states take different values for the traits. Some only allow for numeric values, whereas others allow characters. Some do not allow colons in the name, and one (MBASR which uses Mr Bayes) needs the first trait value to be 0. To allow for us to semi-easily link across different methods, we will make a data frame of our unique tip states and then turn them into a numeric vector. The ordering is all done alphabetically.

Code
d_meta <- tibble(tip_label = tree$tip.label) %>%
  left_join(., d_meta)

# make sure order of habitat preference is the same in the trait vector as the tip labels
sum(d_meta$tip_label == tree$tip.label) == length(tree$tip.label)
[1] TRUE
Code
# SUCCESS if TRUE

# replace NA of outgroup - make it freshwater:terrestrial - the most common state
# phytools::make.simmap cannot take NAs
hab_pref <- setNames(d_meta$habitat_preference, d_meta$tip_label)
hab_pref[is.na(hab_pref)] <- 'freshwater:terrestrial'

# make habitat preference vector numeric
hab_pref_num <- as.numeric(as.factor(hab_pref)) -1
hab_pref_num <- setNames(hab_pref_num, d_meta$tip_label)
hab_pref_num2 <- hab_pref_num + 1

# coding from numeric to character
coding <- tibble(hab_pref = unname(hab_pref), hab_pref_num = unname(hab_pref_num)) %>%
  distinct() %>%
  mutate(hab_pref2 = gsub(':', '.', hab_pref),
         hab_pref_num2 = hab_pref_num + 1) %>%
  arrange(hab_pref) %>%
  mutate(hab_pref_axis = gsub(':', '/ ', hab_pref),
  hab_pref_axis = gsub('_', ' ', hab_pref_axis),
  # rename the columns for easy naming
  initials = c('F', 'FM', 'FT', 'G', 'M', 'MT', 'T'))

coding
# A tibble: 7 × 6
  hab_pref                  hab_pref_num hab_pref2       hab_p…¹ hab_p…² initi…³
  <chr>                            <dbl> <chr>             <dbl> <chr>   <chr>  
1 freshwater                           0 freshwater            1 freshw… F      
2 freshwater:mud_and_shore             1 freshwater.mud…       2 freshw… FM     
3 freshwater:terrestrial               2 freshwater.ter…       3 freshw… FT     
4 generalist                           3 generalist            4 genera… G      
5 mud_and_shore                        4 mud_and_shore         5 mud an… M      
6 mud_and_shore:terrestrial            5 mud_and_shore.…       6 mud an… MT     
7 terrestrial                          6 terrestrial           7 terres… T      
# … with abbreviated variable names ¹​hab_pref_num2, ²​hab_pref_axis, ³​initials
Code
# save out habitat preference vector
saveRDS(hab_pref, here("data/sequencing_rpoB/processed/hab_pref_vec.rds"))

This results in a dataframe which allows us to easily link between numeric and character values, and also gives us a vector - of the same length of tips in our tree - which we can use for estimating ancestral states.

8 Estimate ancestral states using ape::ace()

First lets fit each of the models (ER, SYM, and ARD) using ape::ace(). We can compare these models using AIC and likelihood ratio tests to see which model fits the tree best.

Code
# do ancestral reconstruction with ape
ancestral_states_ape <- ace(hab_pref, tree, model="ER", type="discrete")
ancestral_states_ape1 <- ace(hab_pref, tree, model="SYM", type="discrete")
ancestral_states_ape2 <- ace(hab_pref, tree, model="ARD", type="discrete")

We can do model comparison using both AIC scores and likelihood ratio tests.

Code
# compare models
anova(ancestral_states_ape, ancestral_states_ape1)
Likelihood Ratio Test Table
  Log lik. Df Df change Resid. Dev Pr(>|Chi|)    
1  -5274.0  1                                    
2  -4269.9 21        20     2008.2  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
anova(ancestral_states_ape1, ancestral_states_ape2)
Likelihood Ratio Test Table
  Log lik. Df Df change Resid. Dev Pr(>|Chi|)    
1  -4269.9 21                                    
2  -4188.3 42        21     163.08  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
# check AIC of each model
mod_compare <- tibble(mod = c('ER', 'SYM', 'ARD'),
               aic = c(AIC(ancestral_states_ape), AIC(ancestral_states_ape1), AIC(ancestral_states_ape2)))
mod_compare
# A tibble: 3 × 2
  mod      aic
  <chr>  <dbl>
1 ER    10550.
2 SYM    8582.
3 ARD    8461.

From these comparisons we can see that the ARD model, where all rates in the transition rate are allowed to be different, is the best model in terms of both doing likelihood ratio tests of nested models and AIC scores.

8.1 Do some rudimentary checks of ape::ace()

We can grab the ancestral states from the output of ace(). We can do some checks which include checking whether any of the transition rates are negative or whether they have any standard errors that are NaN, and whether any of the node states have probabilities > 1. This hints the model is too complex/has not converged/not doing a good job.

We can format the transition matrix from the output of ace::ape() to look at the rates. As it is not stated in the ape::ace() directory, we assume the rates are entered as in castor::asr_mk_model(), where the [r,c]-th entry is the transition rate from state r to state c. This is confirmed on this help page.

Code
# see whether any of the SEs are NA. This hints that its not converged properly
ancestral_states_ape$se
[1] 0.007366617
Code
ancestral_states_ape1$se
 [1] 0.03892461 0.14981260 0.01579165 0.03222234 0.01275812 0.09737104
 [7] 0.04687340        NaN 0.01887849        NaN 0.04728073 0.02064626
[13] 0.04107843 0.02381328 0.67620785        NaN        NaN 0.01832579
[19] 0.01817874 0.03706546 0.02527278
Code
ancestral_states_ape2$se
 [1] 0.59684239 0.13658816 0.79849187 0.04593056 1.19351935 0.09628455
 [7] 0.14423377 0.07662671 2.52327616 0.05414065        NaN 0.09630389
[13] 0.19115756        NaN 2.81489199 0.04286403 2.94328382 0.40378628
[19] 0.03356741 0.19850905 0.04300711        NaN 0.82926874 0.05628891
[25] 0.04893926 0.39866405 0.03067986        NaN 2.97189821 0.02924411
[31] 0.05655975 0.14556329 0.05949353 0.68035092 0.07846055 0.07831938
[37] 0.13216573 0.69263824 0.25715882 6.63392690 0.04747305 3.81179275
Code
# as you can see both the symmetric and all rates different models have NaNs for some/all of the transitions

# get marginal likelihood of each ancestral state at each node
d_acr_ape <- ancestral_states_ape2$lik.anc %>%
  as_tibble(rownames = 'node') %>%
  pivot_longer(cols = all_of(coding$hab_pref), values_to = 'probability_ape', names_to = 'habitat_preference')

d_acr_ape_summary <- group_by(d_acr_ape, node) %>%
  summarise(total = sum(probability_ape), .groups = 'drop') %>%
  filter(total != 1)
nrow(d_acr_ape_summary)
[1] 7
Code
# apparently 7 nodes that do not equal 1, but they are very close to 1.

Some of the standard errors for the ARD model are NaN, suggesting convergence problems. However, there are only seven internal nodes with probabilities different from 1, so these estimates are not completely crazy.

8.2 Plotting transition matrices

We can plot the transition matrices to see what the estimated rates are. This uses our custom function ace_transition_df().

The transition matrices plotted here can be interpreted as follows:

  • the transition rates are from state 1 (on the y axis) to state 2 (on the x axis)

  • diagonal rates are when state 1 equals state 2 and are therefore not estimated

  • Red boxes indicate rates which are estimated. This will be useful when we start using custom matrices.

Code
# bind together the transition matrices from each of the standard models for the transition matrices
d_models <- bind_rows(ace_transition_df(ancestral_states_ape) %>% mutate(model = 'equal rates'),
  ace_transition_df(ancestral_states_ape1) %>% mutate(model = 'symmetric'),
  ace_transition_df(ancestral_states_ape2) %>% mutate(model = 'all rates different')) %>%
  # add in colums for reordering and naming of axes
  left_join(., select(coding, state_1 = hab_pref, state_1_num = hab_pref_num2, state_1_label = hab_pref_axis)) %>%
  left_join(., select(coding, state_2 = hab_pref, state_2_num = hab_pref_num2, state_2_label = hab_pref_axis)) %>%
  mutate(transition_rate = round(transition_rate, 2))

# plot transition matrices
# make heatmap of changes
ggplot(d_models, aes(forcats::fct_reorder(state_2_label, state_2_num), forcats::fct_reorder(state_1_label, desc(state_1_num)))) +
  geom_tile(aes(alpha = transition_rate), col = 'black', width = 0.9, height = 0.9) +
  geom_tile(aes(alpha = transition_rate), fill = NA, col = 'red', size = 1.1, filter(d_models, free_param == 'yes'), width = 0.9, height = 0.9) +
  theme_bw(base_size = 14) +
  theme(panel.grid.major = element_blank(),
  legend.position = 'none',
  axis.text.x.top = element_text(angle = 90, vjust = 0.5)) +
  scale_alpha_continuous(range = c(0, 0.6)) +
  geom_text(aes(label = transition_rate), size = MicrobioUoE::pts(10)) +
  scale_x_discrete(position = 'top', labels = scales::label_wrap(13)) +
  scale_y_discrete(position = 'left', labels = scales::label_wrap(13)) +
  labs(y = 'state 1',
  x = 'state 2') +
  coord_fixed() +
  facet_wrap(~model)

8.3 Doing model simplification on the all rates different model

So we have seen that the all rates different model is better than the SYM model, but we can also use the results of the ARD model to make some of the rates 0.

We can view the all rates different transition matrix only.

Code
# ARD transition matrix only
filter(d_models, model == 'all rates different') %>%
  ggplot(., aes(forcats::fct_reorder(state_2_label, state_2_num), forcats::fct_reorder(state_1_label, desc(state_1_num)))) +
  geom_tile(aes(alpha = transition_rate), col = 'black', width = 0.9, height = 0.9) +
  geom_tile(aes(alpha = transition_rate), fill = NA, col = 'red', size = 1.1, filter(d_models, free_param == 'yes' & model == 'all rates different'), width = 0.9, height = 0.9) +
  theme_bw(base_size = 14) +
  theme(panel.grid.major = element_blank(),
  legend.position = 'none',
  axis.text.x.top = element_text(angle = 90, vjust = 0.5),
  plot.title.position = "plot") +
  scale_alpha_continuous(range = c(0, 0.6)) +
  geom_text(aes(label = transition_rate), size = MicrobioUoE::pts(10)) +
  scale_x_discrete(position = 'top', labels = scales::label_wrap(13)) +
  scale_y_discrete(position = 'left', labels = scales::label_wrap(13)) +
  labs(y = 'state 1',
  x = 'state 2') +
  coord_fixed()

The transitions where the rates are estimated to be 0 are:

  • freshwater to marine mud + terrestrial
  • freshwater + terrestrial to marine mud
  • generalist to marine mud
  • marine mud to freshwater + terrestrial
  • marine mud to generalist
  • terrestrial to freshwater + marine mud
  • terrestrial to generalist

We can create a custom matrix and fix these parameters to be 0.

Code
num_states <- unique(hab_pref) %>% length()

# we will set up the custom matrix, this has 49 numbers and then we have to set the correct numbers to 0
custom_matrix <- matrix(1:num_states^2, nrow=num_states)

# make all diagonal numbers NA
for(i in 1:nrow(custom_matrix)){
  custom_matrix[i,i] <- 0
}

# make specific transitions 0 based on the output of the ARD.
custom_matrix[custom_matrix %in% c(14, 19, 26, 28, 31, 32, 36)] <- 0

# replace values of non-zeroes with correct vector of 1:n parameters
custom_matrix[custom_matrix != 0] <- 1:length(custom_matrix[custom_matrix != 0])

custom_matrix
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    0    7   12   17   21    0   30
[2,]    1    0   13   18   22   25   31
[3,]    2    8    0   19    0   26   32
[4,]    3    9   14    0    0   27   33
[5,]    4   10    0    0    0   28   34
[6,]    5   11   15   20   23    0   35
[7,]    6    0   16    0   24   29    0

We can now try and refit the \(Mk\) model to the tree and habitat preference data, but using our custom transition matrix.

Code
# lets try refit the model with these parameters fixed to zero
ancestral_states_ape3 <- ace(hab_pref, tree, model=custom_matrix, type="discrete")

We can now do model simplification, making sure we compare the simpler model (with fewer free parameters) to the more complex model (the all rates different model). We can also compute the AIC and add it to our table.

Code
# compare the less complicated model with the more complicated model
anova(ancestral_states_ape3, ancestral_states_ape2)
Likelihood Ratio Test Table
  Log lik. Df Df change Resid. Dev Pr(>|Chi|)  
1  -4195.0 35                                  
2  -4188.3 42         7     13.387    0.06322 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
# check AIC of each model
mod_compare <- tibble(mod = c('ER', 'SYM', 'ARD', 'ancestral_states_ape3'),
               aic = c(AIC(ancestral_states_ape), AIC(ancestral_states_ape1), AIC(ancestral_states_ape2), AIC(ancestral_states_ape3)))

arrange(mod_compare, aic)
# A tibble: 4 × 2
  mod                      aic
  <chr>                  <dbl>
1 ancestral_states_ape3  8460.
2 ARD                    8461.
3 SYM                    8582.
4 ER                    10550.

So the model where we explicit remove 0s from the model fitting process fits the model loads better (lower AIC, anova p value = 1 indicating the extra parameters - the ARD model - do not help explain the data any better).

Lets now look at the transition matrix from the most recent model to see what can be removed.

Code
# plot transition matrix
ace_transition_df(ancestral_states_ape3) %>%
  left_join(., select(coding, state_1 = hab_pref, state_1_num = hab_pref_num2, state_1_label = hab_pref_axis)) %>%
  left_join(., select(coding, state_2 = hab_pref, state_2_num = hab_pref_num2, state_2_label = hab_pref_axis)) %>%
  mutate(transition_rate = round(transition_rate, 2)) %>%
  ggplot(., aes(forcats::fct_reorder(state_2_label, state_2_num), forcats::fct_reorder(state_1_label, desc(state_1_num)))) +
  geom_tile(aes(alpha = transition_rate, col = free_param), width = 0.9, height = 0.9, size = 1.1) +
  theme_bw(base_size = 14) +
  theme(panel.grid.major = element_blank(),
  legend.position = 'none',
  axis.text.x.top = element_text(angle = 90, vjust = 0.5),
  plot.title.position = "plot") +
  scale_alpha_continuous(range = c(0, 0.6)) +
  geom_text(aes(label = transition_rate), size = MicrobioUoE::pts(10)) +
  scale_x_discrete(position = 'top', labels = scales::label_wrap(13)) +
  scale_y_discrete(position = 'left', labels = scales::label_wrap(13)) +
  labs(y = 'state 1',
  x = 'state 2',
  title = paste('all rates different with', length(custom_matrix[custom_matrix != 0]), 'free parameters', sep = ' ')) +
  coord_fixed() +
  scale_color_manual(values = c('black', 'red'))

The black boxes with 0 in are rates that we fixed to be 0. There are now no other rates that are 0. We will remove the next lowest rate to see whether its presence significantly improves the model fit. This is freshwater + terrestrial to marine mud + terrestrial.

Code
custom_matrix
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    0    7   12   17   21    0   30
[2,]    1    0   13   18   22   25   31
[3,]    2    8    0   19    0   26   32
[4,]    3    9   14    0    0   27   33
[5,]    4   10    0    0    0   28   34
[6,]    5   11   15   20   23    0   35
[7,]    6    0   16    0   24   29    0
Code
# input the rates into the custom matrix
custom_matrix2 <- custom_matrix

# set some more rates to 0
custom_matrix2[custom_matrix2 %in% c(26)] <- 0

# replace values of non-zeroes with correct vector of 1:n parameters
custom_matrix2[custom_matrix2 != 0] <- 1:length(custom_matrix2[custom_matrix2 != 0])

We can now re run the \(Mk\) model.

Code
# lets try refit the model with these parameters fixed to zero
ancestral_states_ape4 <- ace(hab_pref, tree, model=custom_matrix2, type="discrete")

We can do model selection as before, but now this new model (fewer parameters) is compared to the previous custom matrix model (more parameters).

Code
# compare the less complicated model with the more complicated model
anova(ancestral_states_ape4, ancestral_states_ape3)
Likelihood Ratio Test Table
  Log lik. Df Df change Resid. Dev Pr(>|Chi|)    
1  -4221.2 34                                    
2  -4195.0 35         1     52.358  4.624e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
# check AIC of each model
mod_compare <- tibble(mod = c('ER', 'SYM', 'ARD', 'ancestral_states_ape3', 'ancestral_states_ape4'),
               aic = c(AIC(ancestral_states_ape), AIC(ancestral_states_ape1), AIC(ancestral_states_ape2), AIC(ancestral_states_ape3),AIC(ancestral_states_ape4)))

arrange(mod_compare, aic)
# A tibble: 5 × 2
  mod                      aic
  <chr>                  <dbl>
1 ancestral_states_ape3  8460.
2 ARD                    8461.
3 ancestral_states_ape4  8510.
4 SYM                    8582.
5 ER                    10550.

This parameter is super important in the model. When we do model simplification using ANOVAs it says it is important and when we look at AIC scores this new model performs badly. So our best model is ancestral_states_ape3. However, other similar methods - using castor::asr_mk_model() and diversitree::fit_mk() struggle to fit the model with this transition matrix. So the results are not reproducible across very similar algorithms.

So instead we can take the approach to remove all low rates below a certain number, and then simplify from there. I remove all rates from the first custom matrix that are <0.01. This may circumvent a problem if we are at a local optimum for the transition matrix, as when we set one thing to 0 it changes the values and estimates for the other parameters.

Code
# create new custom matrix
custom_matrix3 <- custom_matrix

# get index of rates which are <0.1
which(ancestral_states_ape3$rates < 0.1)
[1]  4  6 17 19 24 26 29
Code
custom_matrix3[custom_matrix3 %in% which(ancestral_states_ape3$rates < 0.1)] <- 0

# replace values of non-zeroes with correct vector of 1:n parameters
custom_matrix3[custom_matrix3 != 0] <- 1:length(custom_matrix3[custom_matrix3 != 0])

# fit habitat preference model
ancestral_states_ape5 <- ace(hab_pref, tree, model=custom_matrix3, type="discrete")

anova(ancestral_states_ape5, ancestral_states_ape3)
Likelihood Ratio Test Table
  Log lik. Df Df change Resid. Dev Pr(>|Chi|)
1  -4189.4 28                                
2  -4195.0 35         7    -11.132          1
Code
# check AIC of each model
mod_compare <- tibble(mod = c('ER', 'SYM', 'ARD', 'ancestral_states_ape3', 'ancestral_states_ape4', 'ancestral_states_ape5'),
               aic = c(AIC(ancestral_states_ape), AIC(ancestral_states_ape1), AIC(ancestral_states_ape2), AIC(ancestral_states_ape3),AIC(ancestral_states_ape4), AIC(ancestral_states_ape5)))

arrange(mod_compare, aic)
# A tibble: 6 × 2
  mod                      aic
  <chr>                  <dbl>
1 ancestral_states_ape5  8435.
2 ancestral_states_ape3  8460.
3 ARD                    8461.
4 ancestral_states_ape4  8510.
5 SYM                    8582.
6 ER                    10550.

So using this method, we have vastly improved the model. We can now plot these results and see if there are other parameters that can be set to zero.

Code
# plot transition matrix
ace_transition_df(ancestral_states_ape5) %>%
  left_join(., select(coding, state_1 = hab_pref, state_1_num = hab_pref_num2, state_1_label = hab_pref_axis)) %>%
  left_join(., select(coding, state_2 = hab_pref, state_2_num = hab_pref_num2, state_2_label = hab_pref_axis)) %>%
  mutate(transition_rate = round(transition_rate, 2)) %>%
  ggplot(., aes(forcats::fct_reorder(state_2_label, state_2_num), forcats::fct_reorder(state_1_label, desc(state_1_num)))) +
  geom_tile(aes(alpha = transition_rate, col = free_param), width = 0.9, height = 0.9, size = 1.1) +
  theme_bw(base_size = 14) +
  theme(panel.grid.major = element_blank(),
  legend.position = 'none',
  axis.text.x.top = element_text(angle = 90, vjust = 0.5),
  plot.title.position = "plot") +
  scale_alpha_continuous(range = c(0, 0.6)) +
  geom_text(aes(label = transition_rate), size = MicrobioUoE::pts(10)) +
  scale_x_discrete(position = 'top', labels = scales::label_wrap(13)) +
  scale_y_discrete(position = 'left', labels = scales::label_wrap(13)) +
  labs(y = 'state 1',
  x = 'state 2',
  title = paste('all rates different with', length(ancestral_states_ape5$rates), 'free parameters', sep = ' ')) +
  coord_fixed() +
  scale_color_manual(values = c('black', 'red'))

We can once again remove low rates (again defined as 0.1) to see if that improves the model.

Code
custom_matrix4 <- custom_matrix3

custom_matrix4[custom_matrix4 %in% which(ancestral_states_ape5$rates < 0.1)] <- 0

# replace values of non-zeroes with correct vector of 1:n parameters
custom_matrix4[custom_matrix4 != 0] <- 1:length(custom_matrix4[custom_matrix4 != 0])

ancestral_states_ape6 <- ace(hab_pref, tree, model=custom_matrix4, type="discrete")

anova(ancestral_states_ape6, ancestral_states_ape5)
Likelihood Ratio Test Table
  Log lik. Df Df change Resid. Dev Pr(>|Chi|)
1  -4180.4 25                                
2  -4189.4 28         3    -18.073          1
Code
# check AIC of each model
mod_compare <- tibble(mod = c('ER', 'SYM', 'ARD', 'ancestral_states_ape3', 'ancestral_states_ape4', 'ancestral_states_ape5', 'ancestral_states_ape6'),
               aic = c(AIC(ancestral_states_ape), AIC(ancestral_states_ape1), AIC(ancestral_states_ape2), AIC(ancestral_states_ape3),AIC(ancestral_states_ape4), AIC(ancestral_states_ape5), AIC(ancestral_states_ape6)))

arrange(mod_compare, aic)
# A tibble: 7 × 2
  mod                      aic
  <chr>                  <dbl>
1 ancestral_states_ape6  8411.
2 ancestral_states_ape5  8435.
3 ancestral_states_ape3  8460.
4 ARD                    8461.
5 ancestral_states_ape4  8510.
6 SYM                    8582.
7 ER                    10550.

Again this model is better than the previous model by a considerable amount. Lets plot it to see what the transition matrix looks like.

Code
# plot transition matrix
ace_transition_df(ancestral_states_ape6) %>%
  left_join(., select(coding, state_1 = hab_pref, state_1_num = hab_pref_num2, state_1_label = hab_pref_axis)) %>%
  left_join(., select(coding, state_2 = hab_pref, state_2_num = hab_pref_num2, state_2_label = hab_pref_axis)) %>%
  mutate(transition_rate = round(transition_rate, 2)) %>%
  ggplot(., aes(forcats::fct_reorder(state_2_label, state_2_num), forcats::fct_reorder(state_1_label, desc(state_1_num)))) +
  geom_tile(aes(alpha = transition_rate, col = free_param), width = 0.9, height = 0.9, size = 1.1) +
  theme_bw(base_size = 14) +
  theme(panel.grid.major = element_blank(),
  legend.position = 'none',
  axis.text.x.top = element_text(angle = 90, vjust = 0.5),
  plot.title.position = "plot") +
  scale_alpha_continuous(range = c(0, 0.6)) +
  geom_text(aes(label = transition_rate), size = MicrobioUoE::pts(10)) +
  scale_x_discrete(position = 'top', labels = scales::label_wrap(13)) +
  scale_y_discrete(position = 'left', labels = scales::label_wrap(13)) +
  labs(y = 'state 1',
  x = 'state 2',
  title = paste('all rates different with', length(ancestral_states_ape6$rates), 'free parameters', sep = ' ')) +
  coord_fixed() +
  scale_color_manual(values = c('black', 'red'))

Ok we can see that some values are now estimated to be 0 (and none are non-zero but < 0.1). We can set these to be zero in our transition matrix and compare the models.

Code
custom_matrix5 <- custom_matrix4

custom_matrix5[custom_matrix5 %in% which(ancestral_states_ape6$rates < 0.1)] <- 0

# replace values of non-zeroes with correct vector of 1:n parameters
custom_matrix5[custom_matrix5 != 0] <- 1:length(custom_matrix5[custom_matrix5 != 0])

ancestral_states_ape7 <- ace(hab_pref, tree, model=custom_matrix5, type="discrete")

anova(ancestral_states_ape7, ancestral_states_ape6)
Likelihood Ratio Test Table
  Log lik. Df Df change Resid. Dev Pr(>|Chi|)
1  -4180.5 20                                
2  -4180.4 25         5    0.11361     0.9998
Code
# check AIC of each model
mod_compare <- tibble(mod = c('ER', 'SYM', 'ARD', 'ancestral_states_ape3', 'ancestral_states_ape4', 'ancestral_states_ape5', 'ancestral_states_ape6', 'ancestral_states_ape7'),
               aic = c(AIC(ancestral_states_ape), AIC(ancestral_states_ape1), AIC(ancestral_states_ape2), AIC(ancestral_states_ape3),AIC(ancestral_states_ape4), AIC(ancestral_states_ape5), AIC(ancestral_states_ape6), AIC(ancestral_states_ape7)))

arrange(mod_compare, aic)
# A tibble: 8 × 2
  mod                      aic
  <chr>                  <dbl>
1 ancestral_states_ape7  8401.
2 ancestral_states_ape6  8411.
3 ancestral_states_ape5  8435.
4 ancestral_states_ape3  8460.
5 ARD                    8461.
6 ancestral_states_ape4  8510.
7 SYM                    8582.
8 ER                    10550.

This model is again better than the previous model! We can again plot the results.

Code
# plot transition matrix
ace_transition_df(ancestral_states_ape7) %>%
  left_join(., select(coding, state_1 = hab_pref, state_1_num = hab_pref_num2, state_1_label = hab_pref_axis)) %>%
  left_join(., select(coding, state_2 = hab_pref, state_2_num = hab_pref_num2, state_2_label = hab_pref_axis)) %>%
  mutate(transition_rate = round(transition_rate, 2)) %>%
  ggplot(., aes(forcats::fct_reorder(state_2_label, state_2_num), forcats::fct_reorder(state_1_label, desc(state_1_num)))) +
  geom_tile(aes(alpha = transition_rate, col = free_param), width = 0.9, height = 0.9, size = 1.1) +
  theme_bw(base_size = 14) +
  theme(panel.grid.major = element_blank(),
  legend.position = 'none',
  axis.text.x.top = element_text(angle = 90, vjust = 0.5),
  plot.title.position = "plot") +
  scale_alpha_continuous(range = c(0, 0.6)) +
  geom_text(aes(label = transition_rate), size = MicrobioUoE::pts(10)) +
  scale_x_discrete(position = 'top', labels = scales::label_wrap(13)) +
  scale_y_discrete(position = 'left', labels = scales::label_wrap(13)) +
  labs(y = 'state 1',
  x = 'state 2',
  title = paste('all rates different with', length(ancestral_states_ape7$rates), 'free parameters', sep = ' ')) +
  coord_fixed() +
  scale_color_manual(values = c('black', 'red'))

There are no longer any rates less than 0.1. Instead we will remove the lowest rate and refit the model to see if it is better or worse.

Code
custom_matrix6 <- custom_matrix5

custom_matrix6[custom_matrix6 %in% which(ancestral_states_ape7$rates == min(ancestral_states_ape7$rates))] <- 0

# replace values of non-zeroes with correct vector of 1:n parameters
custom_matrix6[custom_matrix6 != 0] <- 1:length(custom_matrix6[custom_matrix6 != 0])

ancestral_states_ape8 <- ace(hab_pref, tree, model=custom_matrix6, type="discrete")

anova(ancestral_states_ape8, ancestral_states_ape7)
Likelihood Ratio Test Table
  Log lik. Df Df change Resid. Dev Pr(>|Chi|)  
1  -4182.5 19                                  
2  -4180.5 20         1     3.9943    0.04566 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
# check AIC of each model
mod_compare <- tibble(mod = c('ER', 'SYM', 'ARD', 'ancestral_states_ape3', 'ancestral_states_ape4', 'ancestral_states_ape5', 'ancestral_states_ape6', 'ancestral_states_ape7', 'ancestral_states_ape8'),
               aic = c(AIC(ancestral_states_ape), AIC(ancestral_states_ape1), AIC(ancestral_states_ape2), AIC(ancestral_states_ape3),AIC(ancestral_states_ape4), AIC(ancestral_states_ape5), AIC(ancestral_states_ape6), AIC(ancestral_states_ape7), AIC(ancestral_states_ape8)))

arrange(mod_compare, aic)
# A tibble: 9 × 2
  mod                      aic
  <chr>                  <dbl>
1 ancestral_states_ape7  8401.
2 ancestral_states_ape8  8403.
3 ancestral_states_ape6  8411.
4 ancestral_states_ape5  8435.
5 ancestral_states_ape3  8460.
6 ARD                    8461.
7 ancestral_states_ape4  8510.
8 SYM                    8582.
9 ER                    10550.

This parameter is significant in the model so we should not remove it! We have reached the best model using this approach. That is ancestral_states_ape7.

9 Estimate ancestral states with castor::asr_mk_model()

We can do an equivalent analysis using castor::asr_mk_model(). We can then compare the transition rates and ancestral reconstruction estimates.

We will only fit the model with the best custom transition matrix to compare estimates.

Code
# rename best transition matrix for using it later
best_matrix <- custom_matrix5

rownames(best_matrix) <- colnames(best_matrix) <- sort(coding$hab_pref)

saveRDS(best_matrix, here('data/sequencing_rpoB/processed/transition_rates/best_matrix_mk.rds'))

# get dataframe of transition matrix
d_best_matrix <- ace_transition_df(ancestral_states_ape7) %>%
  left_join(., select(coding, state_1 = hab_pref, state_1_num = hab_pref_num2, state_1_label = hab_pref_axis)) %>%
  left_join(., select(coding, state_2 = hab_pref, state_2_num = hab_pref_num2, state_2_label = hab_pref_axis)) %>%
  mutate(transition_rate = round(transition_rate, 2)) 


# with castor
ancestral_state_castor <- asr_mk_model(tree, 
                                       hab_pref_num2,
                                       Ntrials = 5,
                                       rate_model = best_matrix,
                                       reroot = FALSE)

# check AIC
ancestral_state_castor$AIC
[1] 8403.211
Code
ancestral_state_castor$loglikelihood
[1] -4181.606

We can compare the transition rate estimates.

Code
# get the castor matrix
d_castor_matrix <- ancestral_state_castor$transition_matrix
rownames(d_castor_matrix) <- colnames(d_castor_matrix) <- sort(coding$hab_pref)

d_castor_matrix <- as_tibble(d_castor_matrix, rownames = 'state_1') %>%
  pivot_longer(freshwater:terrestrial, names_to = 'state_2', values_to = 'transition_rate_castor')

d_transition_compare <- left_join(d_castor_matrix, d_best_matrix) 

# plot
filter(d_transition_compare, free_param == 'yes') %>%
  ggplot(., aes(transition_rate, transition_rate_castor)) +
  geom_point(size = 2) +
  theme_bw() +
  labs(x = 'transition rate ape::ace()',
       y = 'transition rate castor::asr_mk_model') +
  geom_abline(aes(slope = 1, intercept = 0))

The transition rates from both approaches match pretty well. This is good news.Estimate ancestral states with diversitree

10 Estimate ancestral states with diversitree::fit_mk()

We can also do the equivalent analysis using diversitree. This might be beneficial because it can also be used for BiSSE models later on.

It is a bit more involved in terms of defining the constraints on the transition matrix.

Code
# all of these methods needs a likelihood function, we can build a Mkn model
lik <- make.mkn(tree, hab_pref_num2, k = max(hab_pref_num2))

# constrain correct parameters to be 0
lik_constrain <- constrain(lik, q14~0, q16~0, q17~0,
                           q27~0,
                           q34~0, q35~0, q36~0,
                           q45~0, q46~0, q47~0,
                           q51~0, q52~0, q53~0, q54~0, q57~0,
                           q63~0, q64~0,
                           q71~0, q72~0, q74~0, q75~0, q76~0)

argnames(lik_constrain)
 [1] "q12" "q13" "q15" "q21" "q23" "q24" "q25" "q26" "q31" "q32" "q37" "q41"
[13] "q42" "q43" "q56" "q61" "q62" "q65" "q67" "q73"
Code
# need to pass start values to it - can grab these from the ape::ace, but we will just pass an average rate to the model
inits <- rep(mean(ancestral_states_ape7$rates), length(argnames(lik_constrain)))

# find the maximum likelihood estimates of this model
mod_ml <- find.mle(lik_constrain, inits)

# pass inits from the ancestral states model
inits2 <- ancestral_states_ape7$rates

# mod_mcmc <- mcmc(lik_constrain, inits2, nsteps = 10, w = 1)

unname(mod_ml$par)
 [1]  1.0066263  1.2638825  0.1641365  4.1649935  8.1061017  5.3156184
 [7]  2.1069324  2.5694354  1.4328076  0.9091348  7.1411489  2.3672608
[13] 20.0996079  8.7366372  0.6853233  5.3680198 14.5781088  7.5150398
[19]  8.1012060  9.2262352
Code
# save out diversitree model
saveRDS(mod_ml, here("data/sequencing_rpoB/processed/diversitree_mk_model.rds"))

We can compare the rates estimated from ape::ace() and diversitree::find.mle() by putting putting the output of diversitree into a dataframe and combining it with our current comparison dataframe.

Code
# grab names of q matrices from diversitree
q_matrix_diversitree <- names(mod_ml$par)

# create dataframe with these
q_matrix <- tibble(q_matrix_diversitree = q_matrix_diversitree) %>%
  # grab first and second elements that are the habitat codings
  mutate(state_1_num = as.numeric(substr(q_matrix_diversitree, 2,2)),
         state_2_num = as.numeric(substr(q_matrix_diversitree, 3,3)),
         transition_rate_diversitree = unlist(mod_ml$par)) %>%
  left_join(select(coding, state_1 = hab_pref, state_1_num = hab_pref_num2)) %>%
  left_join(select(coding, state_2 = hab_pref, state_2_num = hab_pref_num2)) %>%
  select(-state_1_num, -state_2_num, -q_matrix_diversitree)

# combine into comparison dataframe
d_transition_compare <- left_join(d_transition_compare, q_matrix) %>%
  select(state_1, state_2, transition_rate, transition_rate_castor, transition_rate_diversitree) %>%
  mutate(across(starts_with('transition'), round, 2))

# plot to see similarity
ggplot(d_transition_compare, aes(transition_rate, transition_rate_diversitree)) +
  geom_abline(aes(intercept = 0, slope = 1)) +
  geom_point(size = 2) +
  theme_bw()

We can see the estimates are equivalent! This is great news again.

11 Estimate ancestral states using stochastic character mapping using phytools::make.simmap()

An alternative approach is to do stochastic mapping of the ancestral states. What this does is it calculates the conditional likelihood for each character state at each node of the tree, including the root, and then simulate ancestral states at each internal node by sampling from the posterior distribution of states.

We need to do some initial changes to our transition matrix which we can pass to make_simmap() so the \(M_{k}\) model is not fitted by this model. Importantly we need to make sure the rows add up to 0, so we need to make the diagonal value of the transition matrix be the negative of the sum of each row.

Code
# rename best transition matrix for using it later
best_matrix <- custom_matrix5

rownames(best_matrix) <- colnames(best_matrix) <- sort(coding$hab_pref)

# get dataframe of transition matrix
d_best_matrix <- ace_transition_df(ancestral_states_ape7) %>%
  left_join(., select(coding, state_1 = hab_pref, state_1_num = hab_pref_num2, state_1_label = hab_pref_axis)) %>%
  left_join(., select(coding, state_2 = hab_pref, state_2_num = hab_pref_num2, state_2_label = hab_pref_axis)) %>%
  mutate(transition_rate = round(transition_rate, 2)) 


# get best matrix
best_matrix_estimates <- select(d_best_matrix, state_1, state_2, transition_rate) %>%
  spread(state_2, transition_rate) %>%
  column_to_rownames(var = 'state_1') %>%
  mutate(across(everything(), replace_na, 0)) %>%
  as.matrix()

# make diagonal values make things sum to 0
diag(best_matrix_estimates) <- -rowSums(best_matrix_estimates)

We can now run make.simmap() on our tree. We will run it for 100 simulations for now to reduce object space.

Code
# do stochastic mapping of character traits using phytools
# feed in the best transition matrix previously found using other methods
ancestral_states_phytools <- make.simmap(tree, hab_pref, nsim = 100, Q = best_matrix_estimates)

It has finished! We can count the number of transitions between each state using describe.simmap(). We can then plot these to see the distribution of numbers of transitions.

Code
# summarise number of switches between states and time spent in each state
simmap_summary <- describe.simmap(ancestral_states_phytools, plot=FALSE)

# coerce transitions into dataframe
d_transitions <- as.data.frame(simmap_summary$count, col.names = colnames(simmap_summary$count)) %>%
  pivot_longer(cols = c(everything(), -N), names_to = 'transition', values_to = 'n_transitions') %>%
  separate(transition, c('state_1', 'state_2'), sep = ',', remove = FALSE) %>%
  mutate(label = gsub(',', ' -> ', transition))

# plot these
group_by(d_transitions, transition) %>%
  filter(max(n_transitions) > 0) %>%
  mutate(average = mean(n_transitions)) %>%
  ungroup() %>%
  mutate(label = fct_reorder(label, desc(average))) %>%
  ggplot() +
  geom_histogram(aes(n_transitions), fill = 'white', col = 'black', binwidth = function(x) (max(x)-min(x))/nclass.FD(x)) +
  facet_wrap(~ label, scales = 'free_x') +
  theme_bw() +
  labs(title = 'Distribution of number of transitions between states',
  subtitle = 'Facets are ordered by common transitions',
  x = 'Number of transitions',
  y = 'Count')

We can see the number of transitions between states is dominated by the terrestrial and freshwater environments, with relatively few transitions involving the marine mud environment.

We can also look at how was spent in each state in each character map iteration.

Code
# coerce time into dataframe
d_time <- as.data.frame(simmap_summary$times, col.names = colnames(simmap_summary$times)) %>%
  mutate(n = 1:n()) %>%
  select(-total) %>%
  pivot_longer(cols = c(everything(), -n), names_to = 'state', values_to = 'time') %>%
  group_by(n) %>%
  mutate(prop = time/sum(time)) %>%
  ungroup()

# plot these
group_by(d_time, state) %>%
  mutate(average = mean(prop)) %>%
  ungroup() %>%
  mutate(state = fct_reorder(state, desc(average))) %>%
  ggplot() +
  geom_histogram(aes(prop), fill = 'white', col = 'black', binwidth = function(x) (max(x)-min(x))/nclass.FD(x)) +
  facet_wrap(~ state, scales = 'free_x') +
  theme_bw() +
  labs(title = 'Proportion of time spent in each state',
  subtitle = 'Facets are ordered in terms of time spent',
  x = 'Proportion of time spent in state',
  y = 'Count')

From this plot, we can see that the tree is dominated by four states: freshwater+terrestrial, terrestrial, marine mud, and freshwater. However, what is interesting is that even though marine mud is the third most common state, it is involved in very relatively few transitions. Transitions out of being a marine mud specialist only occur into freshwater+marine mud, and this is the second least common transition.

We can also compare the ancestral state estimates garnered from the best ape::ace() model and those from the average stochastic character map. These are likely to be extremely similar given we fixed the Q matrix in our stochastic character mapping process to save time.

Code
# grab ancestral state from simmap
d_acr_phytools <- as.data.frame(simmap_summary$ace, col.names = colnames(simmap_summary$ace), row.names = row.names(simmap_summary$ace)) %>%
  rownames_to_column(., var = 'node') %>%
  pivot_longer(cols = all_of(coding$hab_pref), values_to = 'probability_phytools', names_to = 'habitat_preference')

# grab out marginal likelihood for each node from ape::ace()
d_acr_ape <- ancestral_states_ape6$lik.anc %>%
  as_tibble(rownames = 'node') %>%
  pivot_longer(cols = all_of(coding$hab_pref), values_to = 'probability_ape', names_to = 'habitat_preference')

# grab out marginal likelihood for each node from castor::asr_mk_model()
d_acr_castor <- ancestral_state_castor$ancestral_likelihoods
colnames(d_acr_castor) <- sort(coding$hab_pref)
rownames(d_acr_castor) <- rownames(ancestral_states_ape6$lik.anc)

d_acr_castor <- as_tibble(d_acr_castor, rownames = 'node') %>%
  pivot_longer(cols = all_of(coding$hab_pref), values_to = 'probability_castor', names_to = 'habitat_preference')

# grab out marginal likelihood for each node from diversitree::find.mle()
d_acr_diversitree <- asr.marginal(lik_constrain, unname(mod_ml$par)) %>%
  t() %>%
  as.data.frame(., col.names = colnames(ancestral_states_ape6$lik.anc), row.names = rownames(ancestral_states_ape6$lik.anc)) 

colnames(d_acr_diversitree) <- colnames(ancestral_states_ape6$lik.anc)

d_acr_diversitree <- d_acr_diversitree %>%
  rownames_to_column(var = 'node') %>%
  pivot_longer(cols = all_of(coding$hab_pref), values_to = 'probability_diversitree', names_to = 'habitat_preference')

# look at similarity between ace, castor, and phytools approaches
d_acr_summary <- reduce(list(d_acr_ape, d_acr_phytools, d_acr_castor, d_acr_diversitree), left_join)

# can plot all of these reconstructions to see if they are doing similar things
# will use ggpairs

ggpairs(d_acr_summary, 
  columns = c('probability_diversitree', 'probability_ape', 'probability_castor', 'probability_phytools'),
  mapping = ggplot2::aes(colour = habitat_preference),
  lower = list(continuous = wrap("smooth", se = FALSE, alpha = 0.3, size=0.1))) +
  theme_bw(base_size = 14)

Stochastic character mapping and the maximum likelihood approach estimate ancestral states in slightly different ways. In ape::ace(), ancestral states are inferred using marginal state reconstruction, whereas the estimates from the stochastic character map use the joint reconstruction (I think).

However, if you force the Q matrix to be maximum likelihood estimates then the estimates for the ancestral states should converge. This is what we see happening here. The diversitree estimates based on the marginal state reconstruction are equivalent to phytools. Something strange is going on with the castor estimates that I cannot work out, but three of our methods are extremely close with each other. This is good!

12 Looking at the best model for ancestral state reconstruction.

We can plot the best transition matrix using ggplot2.

Code
# rename best transition matrix for using it later
best_matrix <- custom_matrix5

# get dataframe of transition matrix
d_best_matrix <- ace_transition_df(ancestral_states_ape7) %>%
  left_join(., select(coding, state_1 = hab_pref, state_1_num = hab_pref_num2, state_1_label = hab_pref_axis)) %>%
  left_join(., select(coding, state_2 = hab_pref, state_2_num = hab_pref_num2, state_2_label = hab_pref_axis)) %>%
  mutate(transition_rate = round(transition_rate, 2)) 


# make heatmap of changes
ggplot(d_best_matrix, aes(forcats::fct_reorder(state_2_label, state_2_num), forcats::fct_reorder(state_1_label, desc(state_1_num)))) +
  geom_tile(aes(alpha = transition_rate, col = free_param), width = 0.9, height = 0.9, size = 1.1) +
  theme_bw(base_size = 14) +
  theme(panel.grid.major = element_blank(),
  legend.position = 'none',
  axis.text.x.top = element_text(angle = 90, vjust = 0.5),
  plot.title.position = "plot") +
  scale_alpha_continuous(range = c(0, 0.6)) +
  geom_text(aes(label = transition_rate), size = MicrobioUoE::pts(10)) +
  scale_x_discrete(position = 'top', labels = scales::label_wrap(13)) +
  scale_y_discrete(position = 'left', labels = scales::label_wrap(13)) +
  labs(y = 'state 1',
  x = 'state 2',
  title = paste('all rates different with', length(best_matrix[best_matrix != 0]), 'free parameters', sep = ' ')) +
  coord_fixed() +
  scale_color_manual(values = c('black', 'red'))

Code
ggsave(here('plots/sequencing_rpoB/analyses/transition_matrix.png'), last_plot(), height = 6, width = 8)

From this transition matrix, we can see some really cool patterns.

  • The best model has 17 free parameters, down from 42 in the all rates different model.

  • It is much more common to transition to being more specialist than being more generalist.

    • generalist -> freshwater, generalist -> freshwater/terrestrial, generalist -> margine/terrestrial have high rates

    • freshwater/marine -> freshwater has high rate

    • marine/terrestrial -> marine has high rate

    • Only freshwater/terrestrial has a transition rate INTO generalist

  • Zero rates generally map to transitions across habitats (i.e. freshwater -> terrestrial).

    • This is only non-zero in 1 out of 12 cases.

We can have a look at these habitat preferences and their transition rates in a network, to visualise key stepping stones between states and highly connected trait states.

We will do this using ggraph. I have not implemented this yet.

First need to change our transition matrix into a network.

Code
# set colours
cols_hab <- met.brewer('Austria', n = 7)
names(cols_hab) <- c('mud_and_shore', 'freshwater', 'terrestrial', 'freshwater:terrestrial', 'generalist', 'mud_and_shore:terrestrial', 'freshwater:mud_and_shore')
hab_labels <- c('marine mud', 'freshwater', 'terrestrial', 'freshwater + terrestrial', 'generalist', 'marine mud + terrestrial', 'freshwater + marine mud')

# calculate mean time spent in each state
d_timespent <- group_by(d_time, state) %>%
  summarise(mean = mean(prop), .groups = 'drop')

# turn transition matrix into network to plot
d_network <- as_tbl_graph(select(d_best_matrix, state_1, state_2, transition_rate)) %>%
  activate(edges) %>%
  filter(!is.na(transition_rate) & transition_rate > 0) %>%
  activate(nodes) %>%
  left_join(., select(d_timespent, name = state, mean)) %>%
  mutate(label = gsub(':', '/', name),
  label = gsub('_', ' ', label),
  label = gsub('mud and shore', 'marine mud', label),
  order = c(1, 2, 6, 7, 3, 4, 5),
  hab1 = gsub(':.*.', '', name),
  hab2 = gsub('.*:', '', name)) %>%
  arrange(order)

p <- ggraph(d_network, layout = 'linear', circular = TRUE) + 
  geom_edge_fan(aes(alpha = transition_rate, 
                width = transition_rate),
                arrow = arrow(length = unit(4, 'mm')),
                end_cap = circle(10, 'mm'),
                start_cap = circle(10, 'mm')) + 
  geom_node_point(aes(size = mean,
                col = name)) +
  theme_void() +
  #geom_node_label(aes(label = label, x=xmin), repel = TRUE) +
  scale_size(range = c(2,20)) +
  scale_edge_width(range = c(0.5, 2)) +
  scale_color_manual('Habitat preference', values = cols_hab, labels = c('terrestrial', 'freshwater', 'marine mud', 'generalist')) +
scale_fill_manual('Habitat preference', values = cols_hab, labels = c('terrestrial', 'freshwater', 'marine mud', 'generalist'))

# grab data for points
point_data <- p$data %>%
  select(x, y, label) %>%
  mutate(nudge_x = ifelse(x < 0, -0.5, 0.5),
  nudge_y = ifelse(y < 0, -0.3, 0.3),
  label = gsub('/', '/\n', label))

p + geom_label(aes(nudge_x + x, nudge_y+y, label = label), point_data, size = MicrobioUoE::pts(18)) +
  theme(legend.position = 'none',
  panel.background = element_rect(fill = 'white', colour = 'white')) +
  coord_cartesian(clip = "off") +
  xlim(c(min(point_data$x) + min(point_data$nudge_x) - 0.3), max(point_data$x) + max(point_data$nudge_x) + 0.3) +
  ylim(c(min(point_data$y) + min(point_data$nudge_y) - 0.1), max(point_data$y) + max(point_data$nudge_y) + 0.1)

Code
#ggsave(here('sequencing_rpoB/plots/analyses/transition_plot.pdf'), last_plot(), height = 6, width = 8)
ggsave(here('plots/sequencing_rpoB/analyses/transition_plot.png'), last_plot(), height = 6, width = 8)

This plot is AMAZING and has loads of useful biological interpretation.

  • Marine mud is a common state (larger point), but has very low transition rates away from this state (and can only move into freshwater + marine mud)
  • Generalist is a rare state and has high rates transitioning away from it.
  • Rates going from a more polymorphic state to a simpler, more specialised state are generally higher than rates going from a monomorphic to polymorphic state.
  • Freshwater and its associated polymorphic traits (e.g. freshwater + terrestrial, freshwater + marine mud) are important states that facilitate a lot of transitions between other states.
  • Only 1 out of 6 transitions from monomorphic states to each other is possible in the best model (freshwater -> marine mud).

13 Plot ancestral state reconstructions using ggtree

We can plot the ancestral states using methods from ggtree. For this we need seven colours, one for each state. This is because it is really important to know whether we are confident the ancestral state reconstruction thinks we are in a polymorphic trait, or whether it is stuck between two (or three), monomorphic states.

The idea is that habitat transitions may result in a change in rate of speciation/diversification, so we can plot the nodes where transitions are most likely to occur over the tree to see if they align with any habitat transitions.
We tried this on the whole tree, but it broke my Mac so instead we are doing it on node 4416, which was identified as a parent of nodes which contained a high density of potential transitions.

Code
# grab out marginal likelihood for each node
# will use results from the stochastic character mapping

# add new colours to cols_hab
cols_hab2 <- c(cols_hab, '#00ffdc', '#032a63', '#1e5302')
names(cols_hab2) <- c(names(cols_hab), 'freshwater:terrestrial', 'freshwater:mud_and_shore', 'mud_and_shore:terrestrial')

# lets plot the nodepies for a subset of the tree. We create a plotly of the bamm transitions and identified a clade with lots of transitions in
# lets try on a subset of the data for now. Lets try using all tips from node 4416, where we are pretty confident that a transition occurred in at some point
# replace bootstrap value with node number so we can find it again

# rename tree
tree2 <- tree

# keep tree data for nodes
d_tree <- as_tibble(tree2) %>%
  filter(!str_detect(label, 'otu'))

# check node label is the same as the order in the dataframe
sum(d_tree$label == tree2$node.label) == length(tree2$node.label)

# replace node label with node number
tree2$node.label <- d_tree$node

# subset tree
tree_sub <- extract.clade(tree2, 4416)

# subset tree sub data to just keep nodes
d_tree_sub <- as_tibble(tree_sub) %>%
  filter(!str_detect(label, 'otu'))

# tree_sub$node.label <- filter(d_tree, node %in% d_tree_sub$label)

# replace node labels in acr dataset with new node numbers
d_acr_phytools_sub <- filter(d_acr_phytools, node %in% d_tree_sub$label) %>%
  pivot_wider(., names_from = 'habitat_preference', values_from = 'probability_phytools')
d_acr_phytools_sub$node <- d_tree_sub$node

# grab the relevant shift nodes and convert them to the right number for the subsetted tree
shiftnodes_sub <- shiftnodes[shiftnodes %in% d_tree_sub$label]
shiftnodes_sub <- tibble(shiftnode_original = as.character(shiftnodes_sub)) %>%
  left_join(select(d_tree_sub, node, shiftnode_original = label))

# create node pies for each node
# https://github.com/YuLab-SMU/ggtree/issues/419#issuecomment-877563385

# cols_hab[1:3][sort(names(cols_hab[1:3]))]

pies <- nodepie(d_acr_phytools_sub, cols = 2:ncol(d_acr_phytools_sub), color = cols_hab2[sort(names(cols_hab2))], alpha = 1)
d_pie <- tibble::tibble(node=as.numeric(d_acr_phytools_sub$node), pies=pies)

# plot node pies on the tree with transition position
tree_plot_sub <- ggtree(tree_sub, layout = 'circular', branch.length = 'none') %<+% filter(d_meta, tip_label %in% tree_sub$tip.label) +
new_scale_color() +
geom_tippoint(aes(x=x+1.5, col = habitat_preference), size = 2, position = position_jitter(width = 1, height = 0)) +
scale_color_manual('Habitat preference', values = cols_hab2, labels = c('terrestrial', 'freshwater', 'marine mud', 'generalist', 'freshwater + terrestrial', 'freshwater + marine mud', 'marine mud + terrestrial')) +
scale_fill_manual('Habitat preference', values = cols_hab2, labels = c('terrestrial', 'freshwater', 'marine mud', 'generalist', 'freshwater + terrestrial', 'freshwater + marine mud', 'marine mud + terrestrial')) +
  geom_point2(aes(subset=(node %in% shiftnodes_sub$node)), col = 'red', shape = 21, fill = NA,size=8)

# plot tree
p_acr <- tree_plot_sub %<+% d_pie +
  geom_plot(data=td_filter(!isTip), mapping=aes(x=x,y=y, label=pies), vp.width=0.04, vp.height=0.04, hjust=0.5, vjust=0.5)

ggsave(here('sequencing_rpoB/plots/analyses/acr_plot.png'), p_acr, height = 10, width = 10)
ggsave(here('sequencing_rpoB/plots/analyses/acr_plot.pdf'), p_acr, height = 10, width = 10)

p_acr

# change probabilities to just be marine, freshwater or terrestrial
d_acr_single <- filter(d_acr_phytools, habitat_preference %in% c('terrestrial', 'mud_and_shore', 'freshwater'))
d_acr_double <- filter(d_acr_ape, str_count(habitat_preference, ':') == 1) %>%
  separate_rows(habitat_preference, sep = ':') %>%
  mutate(probability_ape = probability_ape/2)

d_acr_general <- filter(d_acr_ape, habitat_preference == 'generalist') %>%
  group_by(node) %>%
  do(data.frame(habitat_preference = c('terrestrial', 'mud_and_shore', 'freshwater'), probability_ape = .$probability_ape/3)) %>%
  ungroup()

# combine to have just three habitats
d_acr_ape2 <- bind_rows(d_acr_single, d_acr_double, d_acr_general) %>%
  group_by(node, habitat_preference) %>%
  summarise(probability_ape = sum(probability_ape), .groups = 'drop') %>%
  pivot_wider(names_from = 'habitat_preference', values_from = 'probability_ape')

pies <- nodepie(d_acr_phytools_sub, cols = 2:4, color = cols_hab[1:3][sort(names(cols_hab[1:3]))], alpha = 1)
d_pie <- tibble::tibble(node=as.numeric(d_acr_phytools_sub$node), pies=pies)

# plot node pies on the tree with transition position
tree_plot_sub <- ggtree(tree_sub, layout = 'circular', branch.length = 'none') %<+% filter(d_meta, tip_label %in% tree_sub$tip.label) +
new_scale_color() +
geom_tippoint(aes(x=x+1.5, col = hab1, fill = hab2), size = 2, shape = 21, position = position_jitter(width = 1, height = 0)) +
scale_color_manual('Habitat preference', values = cols_hab, labels = c('terrestrial', 'freshwater', 'marine mud', 'generalist')) +
scale_fill_manual('Habitat preference', values = cols_hab, labels = c('terrestrial', 'freshwater', 'marine mud', 'generalist')) +
  geom_point2(aes(subset=(node %in% shiftnodes_sub$node)), col = 'red', shape = 21, fill = NA,size=8)

We can see that lots of the shift changes are associated with nodes where the model is confident in the ancestral state (either freshwwater + terrestrial or marine mud). To me there is not clear evidence that the nodes where shifts are thought to occur are in areas where a transition is likely to have occurred.

An alternative approach here is to not only plot each monomorphic state in the pie charts at the node, and split each polymorphic state into the probability of it either being marine mud, terrestrial, or freshwater. I do not think I favour this approach in the long-term, but it produces a graph that is slightly more consistent with our previous colour schemes.

Code
# change probabilities to just be marine, freshwater or terrestrial
d_acr_single <- filter(d_acr_phytools, habitat_preference %in% c('terrestrial', 'mud_and_shore', 'freshwater'))

d_acr_double <- filter(d_acr_phytools, str_count(habitat_preference, ':') == 1) %>%
  separate_rows(habitat_preference, sep = ':') %>%
  mutate(probability_phytools = probability_phytools/2)

d_acr_general <- filter(d_acr_phytools, habitat_preference == 'generalist') %>%
  group_by(node) %>%
  do(data.frame(habitat_preference = c('terrestrial', 'mud_and_shore', 'freshwater'), probability_phytools = .$probability_phytools/3)) %>%
  ungroup()

# combine to have just three habitats
d_acr2 <- bind_rows(d_acr_single, d_acr_double, d_acr_general) %>%
  group_by(node, habitat_preference) %>%
  summarise(probability_phytools = sum(probability_phytools), .groups = 'drop') %>%
  filter(., node %in% d_tree_sub$label) %>%
  pivot_wider(names_from = 'habitat_preference', values_from = 'probability_phytools')

d_acr2$node <- d_tree_sub$node

pies <- nodepie(d_acr2, cols = 2:4, color = cols_hab[1:3][sort(names(cols_hab[1:3]))], alpha = 1)
d_pie <- tibble::tibble(node=as.numeric(d_acr2$node), pies=pies)

# plot node pies on the tree with transition position
tree_plot_sub <- ggtree(tree_sub, layout = 'circular', branch.length = 'none') %<+% filter(d_meta, tip_label %in% tree_sub$tip.label) +
new_scale_color() +
geom_tippoint(aes(x=x+1.5, col = hab1, fill = hab2), size = 2, shape = 21, position = position_jitter(width = 1, height = 0)) +
scale_color_manual('Habitat preference', values = cols_hab, labels = c('terrestrial', 'freshwater', 'marine mud', 'generalist')) +
scale_fill_manual('Habitat preference', values = cols_hab, labels = c('terrestrial', 'freshwater', 'marine mud', 'generalist')) +
  geom_point2(aes(subset=(node %in% shiftnodes_sub$node)), col = 'red', shape = 21, fill = NA,size=8)

# plot tree
p_acr <- tree_plot_sub %<+% d_pie +
  geom_plot(data=td_filter(!isTip), mapping=aes(x=x,y=y, label=pies), vp.width=0.04, vp.height=0.04, hjust=0.5, vjust=0.5)

ggsave(here('sequencing_rpoB/plots/analyses/acr_plot2.png'), p_acr, height = 10, width = 10)
ggsave(here('sequencing_rpoB/plots/analyses/acr_plot2.pdf'), p_acr, height = 10, width = 10)

p_acr

14 How to compare BAMM results to ancestral state reconstructions

A key question here is how to link the BAMM analysis to these ancestral state reconstructions. One idea I had was to compare all the nodes which have rate shifts to the rest of the nodes in terms of marginal probabilities of ancestral states.

We can pretty easily do this by filtering the data on marginal state probabilities by whether they have been identified as a shift node or not.

We probably need to control for time from tip in this analysis somehow. Obviously deeper tips are more likely to be less certain in the evolutionary model.

This plot shows a density plot for the marginal probability of each node for each state (habitat preference) split by whether the node was a shift node or not.

If there is no difference in ancestral state reconstruction between shift nodes and non-shift nodes, all the distributions should overlay on each other.

Code
# check whether internal node has a shift in or not
d_acr_phytools <- mutate(d_acr_phytools, shift = ifelse(node %in% shiftnodes, 'shift', 'noshift'))

ggplot(d_acr_phytools, aes(probability_phytools)) +
  geom_density(aes(col = shift, fill = shift, probability_phytools, ..scaled..), alpha = 0.1) +
  facet_wrap(~habitat_preference, scales = 'free') +
  scale_fill_manual(values = c('black', 'red')) +
  scale_color_manual(values = c('black', 'red')) +
  theme_bw()

Most of the panels overlay each other pretty consistently. However a couple of them do not (i.e. terrestrial and freshwater + terrestrial). The problem is I do not really understand how to interpret this at the moment and think it is kind of meaningless. For example, the panel for freshwater + terrestrial says that shift nodes are predicted to contain more freshwater + terrestrial. But this is true even when the probability of being freshwater + terrestrial in that node is 0.25.

Think how to cleverly look at these things is something I would really like to discuss further.